llm harmfulness detection
A generative approach to LLM harmfulness detection with special red flag tokens
Xhonneux, Sophie, Dobre, David, Mofakhami, Mehrnaz, Schwinn, Leo, Gidel, Gauthier
Most safety training methods for large language models (LLMs) based on fine-tuning rely on dramatically changing the output distribution of the model when faced with a harmful request, shifting it from an unsafe answer to a refusal to respond. These methods inherently compromise model capabilities and might make auto-regressive models vulnerable to attacks that make likely an initial token of affirmative response. To avoid that, we propose to expand the model's vocabulary with a special token we call red flag token (
- North America > United States > California > Los Angeles County > Los Angeles (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)